{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Building a Multimodal RAG Pipeline over an Auto Insurance Claim\n",
    "\n",
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_parse/blob/main/examples/multimodal/insurance_rag.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This cookbook shows how to use LlamaParse and OpenAI's multimodal GPT-4o model to parse auto insurance claim documents that contain complex tabular data. In this example, we will use an auto insurance claim template form, which contains complex tabular inputs regarding information about the location of the accident, accident description, information about vehicles of both parties, and injury information. The template is shown below.\n",
    "\n",
    "![Auto Insurance Template](https://github.com/user-attachments/assets/aadbaa5b-16d2-490f-be35-f8ee06571633)\n",
    "\n",
    "This example demonstrates how LlamaParse can be used on insurance documents, which often contains complex tabular data. We parse these tabluar PDF files into markdown-formatted tables, which can be indexed and queried over with a `VectorStoreIndex`. This can help insurance companies accelerate the process of gathering information about car accidents from insurance claim documents."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install and Setup\n",
    "\n",
    "Install LlamaIndex, download the data, and apply `nest_asyncio`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget https://github.com/user-attachments/files/16536240/claims.zip -O claims.zip\n",
    "!unzip -o claims.zip\n",
    "!rm claims.zip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nest_asyncio\n",
    "\n",
    "nest_asyncio.apply()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Set up your OpenAI and LlamaCloud keys."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"<Your OpenAI API Key>\"\n",
    "os.environ[\"LLAMA_CLOUD_API_KEY\"] = \"<Your Llamacloud API Key>\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Code Implementation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Set up LlamaParse. We want to parse the PDF files into markdown, translating the tabular data into markdown tables. To ensure accuracy, we will use the GPT-4o multimodal model to parse the PDFs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_parse import LlamaParse\n",
    "\n",
    "parser = LlamaParse(\n",
    "    result_type=\"markdown\",\n",
    "    parsing_instruction=\"This is an auto insurance claim document.\",\n",
    "    use_vendor_multimodal_model=True,\n",
    "    vendor_multimodal_model_name=\"openai-gpt4o\",\n",
    "    show_progress=True,\n",
    ")\n",
    "\n",
    "CLAIMS_DIR = \"claims\"\n",
    "\n",
    "\n",
    "def get_claims_files(claims_dir=CLAIMS_DIR) -> list[str]:\n",
    "    files = []\n",
    "    for f in os.listdir(claims_dir):\n",
    "        fname = os.path.join(claims_dir, f)\n",
    "        if os.path.isfile(fname):\n",
    "            files.append(fname)\n",
    "    return files\n",
    "\n",
    "\n",
    "files = get_claims_files()  # get all files from the claims/ directory\n",
    "md_json_objs = parser.get_json_result(\n",
    "    files\n",
    ")  # extract markdown data for insurance claim document\n",
    "parser.get_images(\n",
    "    md_json_objs, download_path=\"data_images\"\n",
    ")  # extract images from PDFs and save them to ./data_images/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# extract list of pages for insurance claim doc\n",
    "md_json_list = []\n",
    "for obj in md_json_objs:\n",
    "    md_json_list.extend(obj[\"pages\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create helper functions to create a list of `TextNode`s from the markdown tables to feed into the `VectorStoreIndex`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "from pathlib import Path\n",
    "import typing as t\n",
    "from llama_index.core.schema import TextNode, ImageNode\n",
    "\n",
    "\n",
    "def get_page_number(file_name):\n",
    "    \"\"\"Gets page number of images using regex on file names\"\"\"\n",
    "    match = re.search(r\"-page-(\\d+)\\.jpg$\", str(file_name))\n",
    "    if match:\n",
    "        return int(match.group(1))\n",
    "    return 0\n",
    "\n",
    "\n",
    "def _get_sorted_image_files(image_dir):\n",
    "    \"\"\"Get image files sorted by page.\"\"\"\n",
    "    raw_files = [f for f in list(Path(image_dir).iterdir()) if f.is_file()]\n",
    "    sorted_files = sorted(raw_files, key=get_page_number)\n",
    "    return sorted_files\n",
    "\n",
    "\n",
    "def get_text_nodes(json_dicts, image_dir) -> t.List[TextNode]:\n",
    "    \"\"\"Creates nodes from json + images\"\"\"\n",
    "\n",
    "    nodes = []\n",
    "\n",
    "    docs = [doc[\"md\"] for doc in json_dicts]  # extract text\n",
    "    image_files = _get_sorted_image_files(image_dir)  # extract images\n",
    "\n",
    "    for idx, doc in enumerate(docs):\n",
    "        # adds both a text node and the corresponding image node (jpg of the page) for each page\n",
    "        node = TextNode(\n",
    "            text=doc,\n",
    "            metadata={\"image_path\": str(image_files[idx]), \"page_num\": idx + 1},\n",
    "        )\n",
    "        image_node = ImageNode(\n",
    "            image_path=str(image_files[idx]),\n",
    "            metadata={\"page_num\": idx + 1, \"text_node_id\": node.id_},\n",
    "        )\n",
    "        nodes.extend([node, image_node])\n",
    "\n",
    "    return nodes\n",
    "\n",
    "\n",
    "text_nodes = get_text_nodes(md_json_list, \"data_images\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Index the documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import (\n",
    "    VectorStoreIndex,\n",
    "    StorageContext,\n",
    "    load_index_from_storage,\n",
    "    Settings,\n",
    ")\n",
    "from llama_index.embeddings.openai import OpenAIEmbedding\n",
    "from llama_index.llms.openai import OpenAI\n",
    "\n",
    "embed_model = OpenAIEmbedding(model=\"text-embedding-3-large\")\n",
    "llm = OpenAI(\"gpt-4o\")\n",
    "\n",
    "Settings.llm = llm\n",
    "Settings.embed_model = embed_model\n",
    "\n",
    "if not os.path.exists(\"storage_insurance\"):\n",
    "    index = VectorStoreIndex(text_nodes, embed_model=embed_model)\n",
    "    index.storage_context.persist(persist_dir=\"./storage_insurance\")\n",
    "else:\n",
    "    ctx = StorageContext.from_defaults(persist_dir=\"./storage_insurance\")\n",
    "    index = load_index_from_storage(ctx)\n",
    "\n",
    "query_engine = index.as_query_engine()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Example queries are shown below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Michael Johnson filed the insurance claim for the accident that happened on Sunset Blvd."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import display, Markdown\n",
    "\n",
    "response = query_engine.query(\n",
    "    \"Who filed the insurance claim for the accident that happened on Sunset Blvd?\"\n",
    ")\n",
    "display(Markdown(str(response)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Ms. Patel's accident occurred on March 10, 2023, at approximately 9:15 AM in the Boise Towne Square Mall parking lot. She was heading west at a parking space and, after checking her mirrors and blind spots, did not see any approaching vehicles. However, Michael Chen, the driver of another vehicle, was driving too fast through the parking lot and failed to stop in time, resulting in a collision with Ms. Patel's vehicle. This caused significant damage to the rear bumper and trunk of her car."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "response = query_engine.query(\"How did Ms. Patel's accident happen?\")\n",
    "display(Markdown(str(response)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Mr. Johnson's red sedan, a 2020 Honda Accord, was damaged on the front passenger side, including a dented fender and a broken headlight. The estimated repair cost is $3,500."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "response = query_engine.query(\"How was Mr. Johnson's red sedan damaged?\")\n",
    "display(Markdown(str(response)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Mr. Doe's Honda Accord sustained damage to the front bumper, hood, fenders, head/tail lights, windshield, and doors."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "response = query_engine.query(\"How was Mr. Doe's Honda Accord damaged?\")\n",
    "display(Markdown(str(response)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "The witness for Ms. Patel's accident is Sophia Rodriguez. She can be contacted at 5554567890."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "response = query_engine.query(\n",
    "    \"Who are some witnesses for the Ms. Patel's accident and how can we contact them?\"\n",
    ")\n",
    "display(Markdown(str(response)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Yes, Ms. Johnson sustained injuries. She experienced minor injuries, including a bruised knee and some whiplash."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "response = query_engine.query(\n",
    "    \"Did Ms. Johnson sustain any injuries? If so, what were those injuries?\"\n",
    ")\n",
    "display(Markdown(str(response)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Mark Johnson is liable for the damages from the accident on Lombard Street. He was driving a delivery van that collided with the rear of Emily Rodriguez's vehicle. In rear-end collisions, the driver who hits the vehicle in front is typically at fault because they are expected to maintain a safe distance and be able to stop in time to avoid a collision."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "chat_engine = index.as_chat_engine()\n",
    "response = chat_engine.chat(\n",
    "    \"Given the accident that happened on Lombard Street, name a party that is liable for the damages and explain why.\"\n",
    ")\n",
    "display(Markdown(str(response)))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama-parse-5ZmnAQ0r-py3.11",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
